15 research outputs found

    Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

    Full text link
    We study the conditions under which one is able to efficiently apply variance-reduction and acceleration schemes on finite sum optimization problems. First, we show that, perhaps surprisingly, the finite sum structure by itself, is not sufficient for obtaining a complexity bound of \tilde{\cO}((n+L/\mu)\ln(1/\epsilon)) for LL-smooth and μ\mu-strongly convex individual functions - one must also know which individual function is being referred to by the oracle at each iteration. Next, we show that for a broad class of first-order and coordinate-descent finite sum algorithms (including, e.g., SDCA, SVRG, SAG), it is not possible to get an `accelerated' complexity bound of \tilde{\cO}((n+\sqrt{n L/\mu})\ln(1/\epsilon)), unless the strong convexity parameter is given explicitly. Lastly, we show that when this class of algorithms is used for minimizing LL-smooth and convex finite sums, the optimal complexity bound is \tilde{\cO}(n+L/\epsilon), assuming that (on average) the same update rule is used in every iteration, and \tilde{\cO}(n+\sqrt{nL/\epsilon}), otherwise

    Dimension-Free Iteration Complexity of Finite Sum Optimization Problems

    Full text link
    Many canonical machine learning problems boil down to a convex optimization problem with a finite sum structure. However, whereas much progress has been made in developing faster algorithms for this setting, the inherent limitations of these problems are not satisfactorily addressed by existing lower bounds. Indeed, current bounds focus on first-order optimization algorithms, and only apply in the often unrealistic regime where the number of iterations is less than O(d/n)\mathcal{O}(d/n) (where dd is the dimension and nn is the number of samples). In this work, we extend the framework of (Arjevani et al., 2015) to provide new lower bounds, which are dimension-free, and go beyond the assumptions of current bounds, thereby covering standard finite sum optimization methods, e.g., SAG, SAGA, SVRG, SDCA without duality, as well as stochastic coordinate-descent methods, such as SDCA and accelerated proximal SDCA

    Communication Complexity of Distributed Convex Learning and Optimization

    Full text link
    We study the fundamental limits to communication-efficient distributed methods for convex learning and optimization, under different assumptions on the information available to individual machines, and the types of functions considered. We identify cases where existing algorithms are already worst-case optimal, as well as cases where room for further improvement is still possible. Among other things, our results indicate that without similarity between the local objective functions (due to statistical data similarity or otherwise) many communication rounds may be required, even if the machines have unbounded computational power

    On Lower and Upper Bounds in Smooth Strongly Convex Optimization - A Unified Approach via Linear Iterative Methods

    Full text link
    In this thesis we develop a novel framework to study smooth and strongly convex optimization algorithms, both deterministic and stochastic. Focusing on quadratic functions we are able to examine optimization algorithms as a recursive application of linear operators. This, in turn, reveals a powerful connection between a class of optimization algorithms and the analytic theory of polynomials whereby new lower and upper bounds are derived. In particular, we present a new and natural derivation of Nesterov's well-known Accelerated Gradient Descent method by employing simple 'economic' polynomials. This rather natural interpretation of AGD contrasts with earlier ones which lacked a simple, yet solid, motivation. Lastly, whereas existing lower bounds are only valid when the dimensionality scales with the number of iterations, our lower bound holds in the natural regime where the dimensionality is fixed.Comment: A related paper co-authored with Shai Shalev-Shwartz and Ohad Shamir is to be published soo

    Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

    Full text link
    Second-order methods, which utilize gradients as well as Hessians to optimize a given function, are of major importance in mathematical optimization. In this work, we prove tight bounds on the oracle complexity of such methods for smooth convex functions, or equivalently, the worst-case number of iterations required to optimize such functions to a given accuracy. In particular, these bounds indicate when such methods can or cannot improve on gradient-based methods, whose oracle complexity is much better understood. We also provide generalizations of our results to higher-order methods.Comment: 35 pages; Added discussion of matching upper bounds, and generalization to higher-order method

    A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates

    Full text link
    We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from τ\tau rounds ago. First, we show that without stochastic noise, delays strongly affect the attainable optimization error: In fact, the error can be as bad as non-delayed gradient descent ran on only 1/τ1/\tau of the gradients. In sharp contrast, we quantify how stochastic noise makes the effect of delays negligible, improving on previous work which only showed this phenomenon asymptotically or for much smaller delays. Also, in the context of distributed optimization, the results indicate that the performance of gradient descent with delays is competitive with synchronous approaches such as mini-batching. Our results are based on a novel technique for analyzing convergence of optimization algorithms using generating functions

    Symmetry & critical points for a model shallow neural network

    Full text link
    We consider the optimization problem associated with fitting two-layer ReLU networks with kk hidden neurons, where labels are assumed to be generated by a (teacher) neural network. We leverage the rich symmetry exhibited by such models to identify various families of critical points and express them as power series in k12k^{-\frac{1}{2}}. These expressions are then used to derive estimates for several related quantities which imply that not all spurious minima are alike. In particular, we show that while the loss function at certain types of spurious minima decays to zero like k1k^{-1}, in other cases the loss converges to a strictly positive constant. The methods used depend on symmetry, the geometry of group actions, bifurcation, and Artin's implicit function theorem

    On the Principle of Least Symmetry Breaking in Shallow ReLU Models

    Full text link
    We consider the optimization problem associated with fitting two-layer ReLU networks with respect to the squared loss, where labels are assumed to be generated by a target network. Focusing first on standard Gaussian inputs, we show that the structure of spurious local minima detected by stochastic gradient descent (SGD) is, in a well-defined sense, the \emph{least loss of symmetry} with respect to the target weights. A closer look at the analysis indicates that this principle of least symmetry breaking may apply to a broader range of settings. Motivated by this, we conduct a series of experiments which corroborate this hypothesis for different classes of non-isotropic non-product distributions, smooth activation functions and networks with a few layers

    Analytic Characterization of the Hessian in Shallow ReLU Models: A Tale of Symmetry

    Full text link
    We consider the optimization problem associated with fitting two-layers ReLU networks with kk neurons. We leverage the rich symmetry structure to analytically characterize the Hessian and its spectral density at various families of spurious local minima. In particular, we prove that for standard dd-dimensional Gaussian inputs with dkd\ge k: (a) of the dkdk eigenvalues corresponding to the weights of the first layer, dkO(d)dk - O(d) concentrate near zero, (b) Ω(d)\Omega(d) of the remaining eigenvalues grow linearly with kk. Although this phenomenon of extremely skewed spectrum has been observed many times before, to the best of our knowledge, this is the first time it has been established rigorously. Our analytic approach uses techniques, new to the field, from symmetry breaking and representation theory, and carries important implications for our ability to argue about statistical generalization through local curvature

    On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions

    Full text link
    Recent advances in randomized incremental methods for minimizing LL-smooth μ\mu-strongly convex finite sums have culminated in tight complexity of O~((n+nL/μ)log(1/ϵ))\tilde{O}((n+\sqrt{n L/\mu})\log(1/\epsilon)) and O(n+nL/ϵ)O(n+\sqrt{nL/\epsilon}), where μ>0\mu>0 and μ=0\mu=0, respectively, and nn denotes the number of individual functions. Unlike incremental methods, stochastic methods for finite sums do not rely on an explicit knowledge of which individual function is being addressed at each iteration, and as such, must perform at least Ω(n2)\Omega(n^2) iterations to obtain O(1/n2)O(1/n^2)-optimal solutions. In this work, we exploit the finite noise structure of finite sums to derive a matching O(n2)O(n^2)-upper bound under the global oracle model, showing that this lower bound is indeed tight. Following a similar approach, we propose a novel adaptation of SVRG which is both \emph{compatible with stochastic oracles}, and achieves complexity bounds of O~((n2+nL/μ)log(1/ϵ))\tilde{O}((n^2+n\sqrt{L/\mu})\log(1/\epsilon)) and O(nL/ϵ)O(n\sqrt{L/\epsilon}), for μ>0\mu>0 and μ=0\mu=0, respectively. Our bounds hold w.h.p. and match in part existing lower bounds of Ω~(n2+nL/μlog(1/ϵ))\tilde{\Omega}(n^2+\sqrt{nL/\mu}\log(1/\epsilon)) and Ω~(n2+nL/ϵ)\tilde{\Omega}(n^2+\sqrt{nL/\epsilon}), for μ>0\mu>0 and μ=0\mu=0, respectively
    corecore